Model名の後ろについているBからGPU memoryを計算する式

$ M=1.2(P/10^9)(Q/8\ \mathrm{bits})\ \mathrm{GB}

$ M：GPU memory expressed in Gigabyte

$ 1.2：Represents a 20% overhead of loading additional things in GPU memory.

$ P：The amount of parameters in the model. E.g. a 7B model has 7 billion parameters.

$ Q：The amount of bits that should be used for loading the model. E.g. 16 bits, 8 bits or 4 bits.

例：Llama 70B, $ Q=16\ \mathrm{bits}なら

$ M=1.2\cdot70\cdot(16/8)\ \mathrm{GB}

$ =168\ \mathrm{GB}

Bの意味を知らないtakker.icon

なにかのパラメタらしい

後で調べよ

参考にmtane0412.icon

助かりますtakker.icon

記事を読んで調べるtakker.icon

$ M/\mathrm{GB}=1.2\frac{P/10^9\cdot Q\cdot4\ \mathrm{B}}{32\ \mathrm{bits\ B^{-1}}}

$ M：GPU memory expressed in Gigabyte

bytes

$ P：The amount of parameters in the model. E.g. a 7B model has 7 billion parameters.

7B=7,000,000,000だったのか

無次元量

$ 4\ \mathrm{B}：4 bytes, expressing the bytes used for each parameter

bytes

$ 32：There are 32 bits in 4 bytes

$ Q：The amount of bits that should be used for loading the model. E.g. 16 bits, 8 bits or 4 bits.

bits

$ 1.2：Represents a 20% overhead of loading additional things in GPU memory.

無次元量

なんか次元があわないtakker.icon

例を見ながら考えよ

$ M =\frac{P*4B}{32/Q}*1.2wogikaze.icon

分子：パラメータ量(無次元) * 4(byte) * 1.2(無次元) = (byte)

分母：32bits / Q(量子化)bits = (無次元)

でbyteだと思ふwogikaze.icon

なるほど、$ 32は$ 32\ \mathrm{bits\ B^{-1}}ではなく$ 32\ \mathrm{bits}だったのかtakker.icon

There are 32 bits in 4 bytesとあったから、$ 32\ \mathrm{bits}/4\ \mathrm{B}と勘違いしてた

英語を読めないことがバレてしまう

整理するとこうかtakker.icon

$ M=1.2(P/10^9)(Q/8\ \mathrm{bits})\ \mathrm{GB}

真面目に計算するのが面倒なとき用メモmorisoba65536.icon

FP16/BF16モデルなら数値の2倍とちょっと(8Bなら16GB)

8bitなら概ね数値GB(8Bなら8GBちょっと)

4bit量子化ならさらに半分

ただし、ggufのK_Mってついてる奴は一部層が6bitだったり、フォーマットの都合256パラメータに満たない層は(内部的に256パラメータ単位で処理してるので)bf16でかさ増ししたりしてるのでさらにちょっと膨らむ

8B，とかがそもそも概算値なので1割くらいは想定値より余分に食う位に思っとくとよい。WindowsのNVIDIA GPUならメモリオフロードで(速度を犠牲に)多少溢れても無理は効く